SURVEY AND SUMMARY Design and bioinformatics analysis of genome-wide CLIP experiments
نویسندگان
چکیده
The past decades have witnessed a surge of discoveries revealing RNA regulation as a central player in cellular processes. RNAs are regulated by RNAbinding proteins (RBPs) at all post-transcriptional stages, including splicing, transportation, stabilization and translation. Defects in the functions of these RBPs underlie a broad spectrum of human pathologies. Systematic identification of RBP functional targets is among the key biomedical research questions and provides a new direction for drug discovery. The advent of cross-linking immunoprecipitation coupled with high-throughput sequencing (genome-wide CLIP) technology has recently enabled the investigation of genome-wide RBP–RNA binding at single base-pair resolution. This technology has evolved through the development of three distinct versions: HITS-CLIP, PAR-CLIP and iCLIP. Meanwhile, numerous bioinformatics pipelines for handling the genome-wide CLIP data have also been developed. In this review, we discuss the genomewide CLIP technology and focus on bioinformatics analysis. Specifically, we compare the strengths and weaknesses, as well as the scopes, of various bioinformatics tools. To assist readers in choosing optimal procedures for their analysis, we also review experimental design and procedures that affect bioinformatics analyses. INTRODUCTION The diversity of RNA in sequence and structure underpins much of cell heterogeneity and complexity. RNA-binding proteins (RBPs) are proteins that bind to doubleor singlestranded RNAs in cells and form ribonucleoprotein complexes with the bound RNAs. Located in either the nucleus or cytoplasm, or both, they engage in every step of the posttranscriptional modification process, including alternative splicing, regulation of mRNA levels, transport between cellular compartments, alternative polyadenylation, transcript stability, etc. (1,2). For example, the TIAR protein has been shown to be transported from the nucleus to the cytoplasm during Fas-mediated apoptotic cell death (3). One example of an intra-nuclear RBP is Yra1p, which has been found to be involved in mRNA export (4). Cytoplasmic RBPs, on the other hand, include Unr, which has been shown to be required for internally initiating the translation of human rhinovirus RNA (5). RBPs bind target RNAs by recognizing their sequences or/and RNA secondary structures through RNA-binding motifs. For example, the AUF1 protein recognizes RNAs through a signature motif composed of 29–39 nt with high A and U contents and a secondary structure specific to the RNAs (6). Binding of RBPs with RNA targets can also be regulated through competition with other RBPs and noncoding RNAs (7,8). RBPs may influence the global coordination of gene expression by organizing nascent groups of RNAs into downstream chains of the post-transcriptional modification process, through what is known as the ‘RNAoperon’ theory (9). RBPs have been implicated in various *To whom correspondence should be addressed. Tel: +1 214 648 5178; Fax: +1 214 648 5120; Email: [email protected] C © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research Advance Access published May 9, 2015 at U niersity of T exas at D allas on Jne 1, 2015 http://narrdjournals.org/ D ow nladed from 2 Nucleic Acids Research, 2015 types of human diseases (1,10–13). For instance, the RBP Musashi1 was found to be related to many cancer types, including those of the breast, colon, medulloblastoma and glioblastoma, as well as to neurogenesis and neurodegenerative diseases (13). In addition, lack of Fragile X mental retardation protein (FMRP) results in a deficiency in human cognition and premature ovarian insufficiency (14) and the FUS, EWSR1, and TAF15 (FET) protein family is responsible for RNA editing and plays important roles in many diseases (15,16). In summary, studying RNA-protein interactions is necessary to achieve a systematic understanding of transcription, translation and other biological processes. CLIP (cross-linking immunoprecipitation) is a molecular biology technology that employs ultraviolet (UV) crosslinking and immunoprecipitation in order to identify RBP– RNA interactions (17,18). The advantage of CLIP lies in allowing identification of interactions within cells (where the crosslinking occurs) versus interactions that might occur after cells are lysed. CLIP increases the confidence that observed interactions are physiologically relevant and can better justify identification of candidates for experimental validation. In early reports, CLIPed cDNAs were sequenced in a low-throughput manner that yielded a few hundred sequence reads. Recently, next-generation sequencing (NGS) techniques have been applied to globally analyzing transcriptional and post-transcriptional regulation, including mRNA sequencing (19), alternative splicing (20) and miRNA profiling (21). The combination of CLIP with NGS technology has greatly improved our ability to study RBP–RNA interactions on the genome scale (22). While earlier genome-wide CLIP studies focused more on the binding of RBP to mRNAs, recent studies have implicated a wide range of regulatory functions of RBP binding sites in long noncoding RNA (lncRNA) (23), circular RNA (24) and mitochondrial RNA (25). In this study, we first review the general procedure and then compare current genome-wide CLIP technologies. Next, we discuss the major experimental design and bioinformatics analysis considerations. Finally, we provide an overview of the current analysis software and databases for genome-wide CLIP data. Current genome-wide CLIP technologies There are three major technologies for genome-wide CLIP experiments: (i) HITS-CLIP (high-throughput sequencing of RNA isolated by crosslinking immunoprecipitation) (22,26), which is the first version of genome-wide CLIPSeq technology; (ii) Photoactivatable-RibonucleosideEnhanced Crosslinking and Immunoprecipitation (PARCLIP) (27), which improved the signal-to-noise ratio of the characteristic mutations observed in sequencing data by use of nucleoside analog; and (iii) Individual-nucleotide resolution CLIP (iCLIP) (28), which achieved a much higher efficiency in reverse-transcription compared with HITS-CLIP and PAR-CLIP. Throughout this text, we used genome-wide CLIP as a generic name for HITS-CLIP, PAR-CLIP and iCLIP. The field of RNA-regulation has seen rapid growth for all versions of genome-wide CLIP technology (Figure 1). In general, genome-wide CLIP technology involves cross-linking, partial RNA digestion, 0 100 200 300 400 500 600 700 800 90
منابع مشابه
A robust clustering algorithm for identifying problematic samples in genome-wide association studies
SUMMARY High-throughput genotyping arrays provide an efficient way to survey single nucleotide polymorphisms (SNPs) across the genome in large numbers of individuals. Downstream analysis of the data, for example in genome-wide association studies (GWAS), often involves statistical models of genotype frequencies across individuals. The complexities of the sample collection process and the potent...
متن کاملConserved RNA secondary structures in viral genomes: A survey
SUMMARY The genomes of RNA viruses often carry conserved RNA structures that perform vital functions during the life cycle of the virus. Such structures can be detected using a combination of structure prediction and co-variation analysis. Here we present results from pilot studies on a variety of viral families performed during bioinformatics computer lab courses in past years.
متن کاملCommon statistical issues in genome-wide association studies: a review on power, data quality control, genotype calling and population structure.
PURPOSE OF REVIEW Genetic association studies which survey the entire genome have become a common design for uncovering the genetic basis of common diseases, including lipid-related traits. Such studies have identified several novel loci which influence blood lipids. The present review highlights the statistical challenges associated with such large-scale genetic studies and discusses the avail...
متن کاملSURVEY AND SUMMARY Physico-chemical foundations underpinning microarray and next-generation sequencing experiments
Hybridization of nucleic acids on solid surfaces is a key process involved in high-throughput technologies such as microarrays and, in some cases, nextgeneration sequencing (NGS). A physical understanding of the hybridization process helps to determine the accuracy of these technologies. The goal of a widespread research program is to develop reliable transformations between the raw signals rep...
متن کاملpcaGoPromoter - An R Package for Biological and Regulatory Interpretation of Principal Components in Genome-Wide Gene Expression Data
Analyzing data obtained from genome-wide gene expression experiments is challenging due to the quantity of variables, the need for multivariate analyses, and the demands of managing large amounts of data. Here we present the R package pcaGoPromoter, which facilitates the interpretation of genome-wide expression data and overcomes the aforementioned problems. In the first step, principal compone...
متن کامل